{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Heuristic Benchmark\n",
    "\n",
    "This notebook demonstrates the Heuristic Benchmark design pattern\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 1. Regression on poorly understood features\n",
    "\n",
    "Problem: Time interval before a question on Stack Overflow is answered.\n",
    "\n",
    "Benchmark: Median time to first answer over the entire training dataset, so 2120 seconds."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>time_to_answer</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>2120.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   time_to_answer\n",
       "0          2120.0"
      ]
     },
     "execution_count": 1,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "%%bigquery\n",
    "SELECT \n",
    "  bqutil.fn.median(ARRAY_AGG(TIMESTAMP_DIFF(a.creation_date, q.creation_date, SECOND))) AS time_to_answer\n",
    "FROM `bigquery-public-data.stackoverflow.posts_questions` q\n",
    "JOIN `bigquery-public-data.stackoverflow.posts_answers` a\n",
    "ON q.accepted_answer_id = a.id"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Find the error metric of always predicting that it will take 2120 seconds to get an answer. This the baseline metric against which to report model performance."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>mean_absolute_error</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>857315.119106</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   mean_absolute_error\n",
       "0        857315.119106"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "%%bigquery\n",
    "WITH benchmark_eval AS (\n",
    "SELECT \n",
    "  2120 - TIMESTAMP_DIFF(a.creation_date, q.creation_date, SECOND) AS error\n",
    "FROM `bigquery-public-data.stackoverflow.posts_questions` q\n",
    "JOIN `bigquery-public-data.stackoverflow.posts_answers` a\n",
    "ON q.accepted_answer_id = a.id\n",
    ")\n",
    "\n",
    "SELECT\n",
    "   AVG(ABS(error)) AS mean_absolute_error\n",
    "FROM\n",
    "   benchmark_eval"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2. Classification on poorly understood features\n",
    "\n",
    "Problem: Whether or not an accepted answer will be edited.\n",
    "\n",
    "Benchmark: Probability distribution of accepted answers that are edited."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>prob_edited</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0.36226</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   prob_edited\n",
       "0      0.36226"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "%%bigquery\n",
    "SELECT \n",
    "  AVG(IF(a.last_edit_date IS NULL, 0, 1)) AS prob_edited\n",
    "FROM `bigquery-public-data.stackoverflow.posts_questions` q\n",
    "JOIN `bigquery-public-data.stackoverflow.posts_answers` a\n",
    "ON q.accepted_answer_id = a.id"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Problem: Country from which a Stack Overflow question will be answered.\n",
    "\n",
    "Benchmark: Fractions of answers written by people from France, India, and so on."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>from_france</th>\n",
       "      <th>from_india</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0.029717</td>\n",
       "      <td>0.08415</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   from_france  from_india\n",
       "0     0.029717     0.08415"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "%%bigquery\n",
    "SELECT \n",
    "  COUNTIF(ENDS_WITH(u.location, 'France')) / COUNT(u.location) AS from_france,\n",
    "  COUNTIF(ENDS_WITH(u.location, 'India')) / COUNT(u.location) AS from_india\n",
    "FROM `bigquery-public-data.stackoverflow.posts_questions` q\n",
    "JOIN `bigquery-public-data.stackoverflow.posts_answers` a\n",
    "ON q.accepted_answer_id = a.id\n",
    "JOIN `bigquery-public-data.stackoverflow.users` u\n",
    "ON u.id = a.owner_user_id"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3. Regression with one good numeric feature\n",
    "\n",
    "\n",
    "Problem: Predict taxi fare amount given pickup and dropoff locations.\n",
    "The distance between the two points is, intuitively, a key feature.\n",
    "\n",
    "Benchmark: linear regression based on this feature"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>f0_</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>4.644356</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "        f0_\n",
       "0  4.644356"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "%%bigquery\n",
    "With trips AS (\n",
    "SELECT\n",
    "  total_amount,\n",
    "  ST_Distance(ST_GeogPoint(pickup_longitude, pickup_latitude),\n",
    "              ST_GeogPoint(dropoff_longitude, dropoff_latitude))/1000 AS dist\n",
    "FROM `bigquery-public-data.new_york.tlc_yellow_trips_2015`\n",
    "WHERE pickup_latitude BETWEEN 35 and 45\n",
    "AND dropoff_latitude BETWEEN 35 and 45\n",
    "AND pickup_longitude BETWEEN -80 and -70\n",
    "AND dropoff_longitude BETWEEN -80 and -70\n",
    "AND total_amount IS NOT NULL\n",
    ")\n",
    "\n",
    "SELECT AVG(total_amount)/AVG(dist)\n",
    "FROM trips"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 4. Regression with one or two important features\n",
    "\n",
    "Problem: Predict duration of bicycle rental.\n",
    "    \n",
    "Benchmark: Lookup table"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>start_station_name</th>\n",
       "      <th>is_peak</th>\n",
       "      <th>predicted_duration</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Contact Centre, Southbury House</td>\n",
       "      <td>False</td>\n",
       "      <td>7012.500000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Stewart's Road, Nine Elms</td>\n",
       "      <td>False</td>\n",
       "      <td>6401.018182</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Speakers' Corner 2, Hyde Park</td>\n",
       "      <td>True</td>\n",
       "      <td>4455.441717</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Speakers' Corner 2, Hyde Park</td>\n",
       "      <td>False</td>\n",
       "      <td>3785.754375</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Speakers' Corner 1, Hyde Park</td>\n",
       "      <td>True</td>\n",
       "      <td>3728.008525</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>Stewart's Road, Nine Elms</td>\n",
       "      <td>True</td>\n",
       "      <td>3727.422680</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>Speakers' Corner 1, Hyde Park</td>\n",
       "      <td>False</td>\n",
       "      <td>3702.115147</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>Black Lion Gate, Kensington Gardens</td>\n",
       "      <td>True</td>\n",
       "      <td>3653.733728</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>Black Lion Gate, Kensington Gardens</td>\n",
       "      <td>False</td>\n",
       "      <td>3552.613008</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>Mechanical Workshop Penton</td>\n",
       "      <td>True</td>\n",
       "      <td>3533.424658</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                    start_station_name  is_peak  predicted_duration\n",
       "0      Contact Centre, Southbury House    False         7012.500000\n",
       "1            Stewart's Road, Nine Elms    False         6401.018182\n",
       "2        Speakers' Corner 2, Hyde Park     True         4455.441717\n",
       "3        Speakers' Corner 2, Hyde Park    False         3785.754375\n",
       "4        Speakers' Corner 1, Hyde Park     True         3728.008525\n",
       "5            Stewart's Road, Nine Elms     True         3727.422680\n",
       "6        Speakers' Corner 1, Hyde Park    False         3702.115147\n",
       "7  Black Lion Gate, Kensington Gardens     True         3653.733728\n",
       "8  Black Lion Gate, Kensington Gardens    False         3552.613008\n",
       "9           Mechanical Workshop Penton     True         3533.424658"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "%%bigquery\n",
    "CREATE TEMPORARY FUNCTION is_peak_hour(start_date TIMESTAMP) aS\n",
    "(EXTRACT(DAYOFWEEK FROM start_date) BETWEEN 2 AND 6 -- weekday\n",
    "    AND (\n",
    "       EXTRACT(HOUR FROM start_date) BETWEEN 6 AND 10\n",
    "       OR\n",
    "       EXTRACT(HOUR FROM start_date) BETWEEN 15 AND 18))\n",
    ";\n",
    "\n",
    "SELECT \n",
    "   start_station_name,\n",
    "   is_peak_hour(start_date) AS is_peak,\n",
    "   AVG(duration) AS predicted_duration,\n",
    "FROM `bigquery-public-data.london_bicycles.cycle_hire`\n",
    "GROUP BY 1, 2\n",
    "ORDER BY predicted_duration DESC\n",
    "LIMIT 10"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, use this benchmark to compute the overall RMSE, so that you can compare with the model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>rmse</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>9814.442983</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "          rmse\n",
       "0  9814.442983"
      ]
     },
     "execution_count": 1,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "%%bigquery\n",
    "CREATE TEMPORARY FUNCTION is_peak_hour(start_date TIMESTAMP) aS\n",
    "(EXTRACT(DAYOFWEEK FROM start_date) BETWEEN 2 AND 6 -- weekday\n",
    "    AND (\n",
    "       EXTRACT(HOUR FROM start_date) BETWEEN 6 AND 10\n",
    "       OR\n",
    "       EXTRACT(HOUR FROM start_date) BETWEEN 15 AND 18))\n",
    ";\n",
    "\n",
    "WITH benchmark AS (\n",
    "SELECT \n",
    "   start_station_name,\n",
    "   is_peak_hour(start_date) AS is_peak,\n",
    "   AVG(duration) AS predicted_duration,\n",
    "FROM `bigquery-public-data.london_bicycles.cycle_hire`\n",
    "GROUP BY 1, 2\n",
    ")\n",
    "\n",
    "SELECT\n",
    "   SQRT( SUM( (duration - predicted_duration)*(duration - predicted_duration)) / COUNT(duration) ) AS rmse\n",
    "FROM `bigquery-public-data.london_bicycles.cycle_hire` c\n",
    "JOIN benchmark b\n",
    "ON c.start_station_name = b.start_station_name AND is_peak_hour(c.start_date) = b.is_peak"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Copyright 2020 Google Inc. Licensed under the Apache License, Version 2.0 (the \"License\"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License"
   ]
  }
 ],
 "metadata": {
  "environment": {
   "name": "tf2-gpu.2-1.m54",
   "type": "gcloud",
   "uri": "gcr.io/deeplearning-platform-release/tf2-gpu.2-1:m54"
  },
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.8"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
